Introduction to a paper by Efron and Thisted ‘ Estimating the number of unseen species : How many words did Shakespeare know ? ’
نویسنده
چکیده
This paper is the first of two written by Brad Efron and Ron Thisted studying the frequency distribution of words in the Shakespearean canon. The key idea due to Fisher in the context of sampling of species is simple and elegant. When applied to Shakespeare the idea appears to be preposterous: an author has a personal vocabulary of word species represented by a distribution G, and text is generated by sampling from this distribution. Most results do not require successive words to be sampled independently, which leaves room for individual style and context, but stationarity is needed for prediction and inference. The expected number of words that occur x ≥ 1 times in a large sample of n words is
منابع مشابه
Discussion of : Statistical Analysis of an Archaeological Find
I begin this discussion by quoting a Mosaic law. This is not one that can be found in the Torah, but I know it to be authentic because I heard it from the mouth of Moses himself. The law is, " Statistics is the umpire of the sciences. " This law was told me by Lincoln Moses, one of the top applied statisticians of the twentieth century, a real craftsman with data and a master of the application...
متن کاملOptimal prediction of the number of unseen species.
Estimating the number of unseen species is an important problem in many scientific endeavors. Its most popular formulation, introduced by Fisher et al. [Fisher RA, Corbet AS, Williams CB (1943) J Animal Ecol 12(1):42-58], uses n samples to predict the number U of hitherto unseen species that would be observed if [Formula: see text] new samples were collected. Of considerable interest is the lar...
متن کاملConsistent Estimation of the Number of Unseen Elements
We observe a sample of text of n tokens from a large corpus of written text and note the occurrence of N distinct word types. We then ask what the total number of unseen word types in the population from which the sample was drawn is. The commonly used LNRE (large number of rare events) regime suggests a natural estimator of the number of unseen word types in the population using the relatively...
متن کاملEstimating the Prediction Function and the Number of Unseen Species in Sampling with Replacement
A sample of N units is taken from a population consisting of an unknown number of species. We are interested in estimating the number of species and the prediction function for future sampling. The prediction function is defined as the expected number of new species that will be found if an additional sample of size tN is taken, for any positive real number t. In this paper we point out that an...
متن کاملHow textbooks (and learners) get it wrong: A corpus study of modal auxiliary verbs
Many elements contribute to the relative difficulty in acquiring specific aspects of English as a foreign language (Goldschneider & DeKeyser, 2001). Modal auxiliary verbs (e.g. could, might), are examples of a structure that is difficult for many learners. Not only are they particularly complex semantically, but especially in the Malaysian context ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013